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Abstract 

This paper evaluates the tradeoffs involved in the design of the 
software-extended memory system of Alewife, a multiprocessor ar- 
chitecture that implements coherent sharedmemory through a com- 
bination of hardware and software mechanisms. For each block of 
memory, Alewife implements between zero and five coherence di- 
rectory pointers in hardware and allows software to handle requests 
when the pointers are exhausted. The software includes a flexible 
coherence interface that facilitates protocol software implementa- 
tion. This interface is indispensable for conducting experiments 
and has proven important for implementing enhancements to the 
basic system. 

Simulations of a number of applications running on a complete 
system (with up to 256 processors) demonstrate that the hybrid 
architecture with five pointers achieves between 71% and 100% 
of full-map directory performance at a constant cost per process- 
ing element. Our experience in designing the software protocol 
interfaces and experiments with a variety of system configurations 
lead to a detailed understanding of the interaction of the hardware 
and software components of the system. The results show that a 
small amount of shared memory hardware provides adequate per- 
formance: One-pointer systems reach between 42% and 100% of 
full-map performance on our parallel benchmarks. A software- 
only directory architecture with no hardware pointers has lower 
performance but minimal cost. 



1 Introduction 

Implementing shared memory for a large-scale multiprocessor re- 
quires balancing the performance of the system as a whole with 
the complexity and cost of its hardware and software components. 
Shared memory itself helps control the complexity of the appli- 
cation software written for a machine, but it requires an efficient 
design to achieve this goal. The Alewife architecture^] uses a 
combination of hardware and software to provide shared memory 
at a constant cost per processing node, without sacrificing perfor- 
mance. Following the integrated systems approach, the architecture 
uses hardware to implement common memory accesses and uses 
software to extend the hardware by handling potentially complex 
scenarios. 

The primary contribution of this paper is the demonstration of 
a complete software-extended shared memory system that allows 
measurement of the performance of its software components. This 
system proves that the software extension approach is a viable 



alternative for implementing a shared memory system, in terms 
of both cost and performance. Rather than advocating a specific 
machine configuration, the paper seeks to examine the performance 
versus cost tradeoffs inherent in implementing software-extended 
shared memory. 

At the heart of a shared memory design lies the problem of pro- 
viding fast average access time while ensuring a coherent memory 
model. Directory-based cache coherence protocols provide an effi- 
cient implementation of coherent shared memory for large systems . 
These protocols allow each processing node to take advantage of 
typical memory access patterns by caching frequently used data, 
even when the data may be shared by other nodes. A directory 
is a structure that helps enforce the coherence of cached data by 
maintaining pointers to the locations of cached copies of each mem- 
ory block. When one node modifies a block of data, the memory 
system uses the information stored in the directory to enforce a 
coherent view of the data. Typically, directories are not monolithic 
structures but are distributed to the processing nodes along with a 
system's shared memory. 

A promising design strategy, central to the Alewife architecture, 
uses a combination of hardware and software to implement a cost- 
efficient directory [9]. Since most data blocks in a shared memory 
system are shared by a small number of processing nodes [2, 32, 8], 
the hardware can implement a small set of pointers, and provide 
mechanisms to allow the system's software to extend the directory 
when the set of pointers is insufficient for enforcing coherence. 
This software-extension technique catalyzes the balance between a 
system's performance and cost. 

LimitLESS directories, a scheme proposed in [9], is a software- 
extended coherence protocol that permits a tradeoff between the 
cost and the performance of a shared memory system. LimitLESS, 
which stands for a Limited directory, Locally Extended through 
Software Support, implements a small number of pointers in a hard- 
ware directory (zero through five in Alewife), so that the hardware 
can track a few copies of any memory block. When these point- 
ers are exhausted, the memory system hardware interrupts a local 
processor, thereby requesting it to maintain correct shared memory 
behavior by extending the hardware directory with software. 

Another set of software-extended protocols (termed DiriSW) 
were proposed in [14] and [34]. These protocols use only one 
hardware pointer, rely on software to broadcast invalidates, and 
use hardware to accumulate the acknowledgments. In addition, 
they allow the programmer or compiler to insert Check-In/Check- 
Out (CICO) directives into programs to minimize the number of 
software traps. 



All software-extended memory systems require a battery of ar- 
chitectural mechanisms to permit a designer to make the cost versus 
performance tradeoff. First, the shared memory hardware must be 
able to invoke extension software on the processor, and the pro- 
cessor must have complete access to the memory and network 
hardware[9, 19, 34]. Second, the hardware must guarantee for- 
ward progress in the face of protocol thrashing scenarios and high- 
availability interrupts [20]. 

Each processing node must also provide support for location- 
independent addressing, which is a fundamental requirement of 
shared memory. Hardware support for location-independent ad- 
dressing permits the software to issue an address that refers to an 
object without knowledge of where it is resident. This hardware 
support includes an associative matching mechanism to detect if 
the object is cached, a mechanism to translate the object address 
and identify its home location if it is not cached, and a mechanism 
to issue a message to fetch the object from a remote location if it is 
not in local memory. 

Since these mechanisms comprise the bulk of the complexity 
of a software-extended system, it is important to note that the 
benefits of these mechanisms extend far beyond the implementation 
of shared memory[17]. Alternative approaches to implementing 
shared memory proposed in [26, 30, 21] use hardware mechanisms 
that allocate directory pointers dynamically. These schemes do not 
require the mechanisms listed above, but they lack the flexibility of 
protocol and application software design. 

A number of systems rely primarily on software to implement 
the mechanisms required to support shared memory [10, 23, 11, 
6, 5, 4]. These systems implement coherent shared memory at 
low cost; however, providing location-independent addressing in 
software forces the granularity of data sharing to be much larger 
than in software-extended systems. In contrast, this paper proposes 
a software-only directory architecture, which implements location- 
independent addressing in hardware but relies on software to handle 
all inter-node memory accesses. This software-extended scheme is 
a low-cost alternative that allows threads to share small blocks of 
data. 

Unlike previous studies of software-extended schemes, this pa- 
per analyzes a complete system, whose hardware is in the final 
stages of fabrication, and whose software is fully functional. De- 
tailed simulations of the system lead to several conclusions about 
the implementation of software-extended shared memory: The 
minimum amount of shared memory hardware that is required to 
provide adequate performance is a single directory pointer (that also 
serves as an acknowledgment counter) per memory block. Beyond 
this level of hardware support, the cost and mapping of a system's 
DRAM become more important factors than performance. In ad- 
dition, processor caches should include more associativity than a 
simple direct-mapped cache. Alewife uses a victim cache[16, 20] 
to provide the required extra associativity. 

The paper extends previous work in the performance analysis 
of software-extended coherence protocols [9, 34] to a much wider 
spectrum, ranging from zero hardware pointers (the software-only 
directory architecture) to a full-map protocol, through the use of 
controlled experiments using a synthetic benchmark and a set of 
application programs. By studying the behavior of real software 
protocol handlers, this paper confirms results presented in [9] on the 
similarity in performance of LimitLESSi (one hardware pointer), 
LimifLESS2, LimitLESSi and full-map protocols. We also con- 



firm the findings in [34] that the performance of suitably tuned 
one-pointer protocols is competitive with that of multiple-pointer 
protocols. 

In order to study the software side of hybrid shared memory 
systems, this paper investigates two different software systems that 
use the same hardware to achieve different goals. One system, 
written in C, incorporates a flexible coherence interface that fa- 
cilitates protocol software implementation. This interface proved 
indispensable for conducting experiments over the whole spectrum 
of software-extended protocols. Our continuing research relies on 
the flexibility of this interface for implementing enhancements to 
the basic system. The other system, which is written in assembly- 
language, is a highly optimized and specialized implementation. 
While the specialized system supports only a narrow range of func- 
tionality, it shows the potential benefits of well-tuned software. 

This paper presents case studies that examine how application 
performance varies over the spectrum of software-extended pro- 
tocols. Before reaching these case studies, Section 2 gives an 
overview of the cost versus performance tradeoff that a software- 
extended memory system exploits and describes a notation for such 
systems. The paper's experimental methodology is then described 
in Section 3. Section 4 analyzes how the implementation of a 
software-extended protocol affects application performance; Sec- 
tion 5 examines the converse: how application characteristics deter- 
mine protocol performance. Section 6 presents the case studies of 
application performance, and Section 7 suggests enhancements to 
improve programmability and performance. Finally, Section 8 dis- 
cusses the impact of this study on the design of software-extended 
coherent memory systems. 



2 A spectrum of protocols 

The number of directory pointers that are implemented in hardware 
is an important design decision involved in building a software- 
extended shared memory system. More hardware pointers mean 
fewer situations in which a system must rely on software to enforce 
coherence, thereby increasing performance. Having fewer hard- 
ware pointers means a lower implementation cost, at the expense 
of reduced performance. This tradeoff suggests a whole spectrum 
of protocols, ranging from pointers to n pointers, where n is the 
number of nodes in the system. 



2.1 The n pointer protocol 

The full-map protocol[7], which is implemented in the DASH 
multiprocessor[21], uses n pointers for every block of memory 
in the system and requires no software extension. Although this 
protocol permits an efficient implementation that uses only one bit 
for each pointer, the sheer number of pointers makes it extremely 
expensive for systems with large numbers of nodes. Despite the 
cost of the full-map protocol, it serves as a good performance goal 
for the software-extended schemes. 



2.2 2^ 



n 



1) pointer protocols 



There is a range of protocols that use a software-extended coherence 
scheme to implement shared memory. It is this range of protocols 



that allows the designer to trade hardware cost and system perfor- 
mance. From the point of view of implementation complexity, the 
protocols that implement between 2 and n — 1 pointers in hardware 
are homogeneous. Of course, the n — 1 pointer protocol would be 
even more expensive to implement than the full-map protocol, but it 
still requires exactly the same hardware and software mechanisms 
as the protocols at the other end of the spectrum. 

The protocol extension software needs to service only two kinds 
of messages: read and write requests. It handles read requests by 
allocating an extended directory entry (if necessary), emptying all 
of the hardware pointers into the software structure, and recording a 
pointer to the node that caused the directory overflow. Subsequent 
requests may be handled by the hardware until the next overflow 
occurs. For all of these protocols, the hardware returns the appro- 
priate data to requesting nodes; the software only needs to record 
requests that overflow the hardware directory. 

To handle write requests after an overflow, the software transmits 
invalidation messages to every node with a pointer in the hardware 
directory or in the software directory extension. The software 
then returns the hardware directory to a mode that collects one 
acknowledgment message for each transmitted invalidation. 

2.3 Zero-pointer protocols 

Since the software-only directory[28] has no directory memory, it 
requires substantially different software than the 2 <-► (n — 1) range 
of protocols. This software must implement all coherence protocol 
state transitions for inter-node accesses. 

While other implementations are possible, our version of the 
zero-pointer protocol uses one extra bit per memory block to opti- 
mize the performance of purely intra-node accesses: the bit indi- 
cates whether the associated memory block has been accessed at 
any time by a remote node. When the bit is clear (the default value), 
all memory accesses from the local processor are serviced without 
software traps, just as in a uniprocessor. When an inter-node re- 
quest arrives, the bit is set and the extension software flushes the 
block from the local cache. Once the bit is set, all subsequent ac- 
cesses — including intra-node requests — are handled by software 
extension. 



2.4 One-pointer protocols 

The one-pointer protocols are a hybrid of the protocols discussed 
above. This paper studies three variations of this class of protocols. 
All three use the same software routine to transmit data invalidations 
sequentially, but they differ in the way that they collect the messages 
that acknowledge receipt of the invalidations. The first variation 
handles the acknowledgments completely in software, requiring a 
trap from the hardware upon the receipt of each message. During 
the invalidation/acknowledgment process, the hardware pointer is 
unused. 

The second protocol handles all but the last of a sequence of 
acknowledgments in hardware. If a node transmits 64 invalida- 
tions, then the hardware will process the first 63 invalidations. This 
variation uses the hardware pointer to store a count of the number 
of acknowledgments that are still outstanding. During this pro- 
cess, the hardware will also transmit busy messages to requesting 
nodes, eliminating the livelock problem. Upon receiving the 64 th 



acknowledgment, the hardware invokes the software, which takes 
care of transmitting data to the requesting node. 

The third protocol handles all acknowledgment messages in 
hardware. This protocol actually requires storage for two hard- 
ware pointers: one pointer to store the requesting node's identifier 
and another to count acknowledgments . Although a designer would 
always choose to implement a two-pointer protocol over this vari- 
ation of the one-pointer protocol, it still provides a useful baseline 
for measuring the performance of the other two variations. 

2.5 A notation for the spectrum 

We now introduce a notation that allows us to articulate clearly 
the differences between various implementations and facilitates a 
precise cost comparison. 

Our notation is derived from a nomenclature for directory-based 
coherence protocols introduced in [2]. In the previous notation, a 
protocol was represented as Dir 8 X, where i represented the num- 
ber of explicit copies tracked, and X was B or NB depending 
on whether or not the protocol issued broadcasts. Notice that this 
nomenclature does not distinguish between the functionality imple- 
mented in the software and in the hardware. Our notation attempts 
to capture the spectrum of features of software-extended proto- 
cols that have evolved over the past several years, and previously 
termed LimifLESSi, LimifLESS4, and others in [9], and DiriSW, 
DinSW+, and others in [14, 34]. 

For both hardware and software, our notation divides the mech- 
anisms into two classes: those that dictate directory actions upon 
receipt of processor requests, and those that dictate directory ac- 
tions for acknowledgments. 

Accordingly, our notation specifies a protocol as: DiriHxSy.A, 

where i is the number of explicit pointers recorded by the system - 
in hardware or in software - for a given block of data. 

The parameter X is the number of pointers recorded in a hard- 
ware directory when a software extension exists. X is NB if the 
number of hardware pointers is i and no more than i shared copies 
are allowed, and is B if the number of hardware pointers is i and 
broadcasts are used when more than i shared copies exist. Thus the 
full-map protocol in DASH [21] is termed Dir n H^BS-- 

The parameter Y is NB if the hardware-software combination 
records i explicit pointers and allows no more than i copies. Y is 
B if the software resorts to a broadcast when more than i copies 
exist. 

The A parameter is ACK if a software trap is invoked on every 
acknowledgment. A missing A field implies that the hardware 
keeps an updated count of acknowledgments received. Finally, the 
A parameter is LACK if a software trap is invoked only on the last 
acknowledgment. 

According to this notation, the LimifLESSi protocol defined 
in [9] is termed Dir„HiStfB, denoting that it records n pointers, of 
which only one is in hardware. The hardware handles all acknowl- 
edgments and the software issues invalidations to shared copies 
when a write request occurs after an overflow. This paper deals 
with three variants of the one-pointer protocols defined above. In 
our notation, the three one-pointer protocols are Dir n H\S^B,ACK^ 
Dir n H\SNB,LACKi and Dir n HiSNB, respectively. 

The set of software-extended protocols introduced in [14] 



and [34] can also be expressed in terms of our notation. The 
DiriSW protocol maintains one pointer in hardware, resorts to 
software broadcasts when more than one copy exists, and counts 
acknowledgments in hardware. In addition, their protocol traps 
into software on the last acknowledgment[33]. In our notation, 
this protocol is represented as Dir\H\Ss,LACK- This protocol is 
different from the Dir n H\S^B,LACK protocol in that Dir\H\SB,LACK 
maintains only one explicit pointer, while Dir n H\S^B,LACK main- 
tains one pointer in hardware and extends the directory to n point- 
ers in software. An important consequence of this difference is 
that the Dir n H\S^B,LACK potentially traps on read requests, while 
DinHiS B ,LACK does not. Unlike Dir n H x S NB ,LACK, Dir\H\S BiLACK 
must issue broadcasts on write requests to memory blocks that are 
cached by multiple nodes. 



3 Methodology 



This section first describes the MIT Alewife machine, which pro- 
vides a proof of concept for software-extended memory systems 
and a platform for experimenting with many aspects of multipro- 
cessor design and programming. While the machine supports an 
interesting range of protocols, it does not implement the full spec- 
trum of software-extended schemes that this paper evaluates. Only 
a simulation system can provide the range of protocols, the de- 
terministic behavior, and the non-intrusive observation functions 
that are required for analyzing the spectrum of software-extended 
protocols. The second half of this section describes the simulation 
system used to do the experiments discussed in the remainder of 
the paper. 

3.1 The Alewife Machine 

Alewife is a large-scale multiprocessor with distributed shared 
memory. Figure 1 shows an enlarged view of a node in the Alewife 
machine. Each node consists of a 33 MHz Sparcle processor[l], 
64K bytes of direct-mapped cache, 4 Mbytes of globally-shared 
main memory, and a floating-point coprocessor. The nodes commu- 
nicate via messages through a network[29] with a mesh topology. A 
single-chip communications and memory management (CMMU) 
on each node holds the cache tags and implements the memory 
coherence protocol by synthesizing messages to other nodes. All 
of the node components, with the exception of the CMMU, have 
been fabricated and tested. The CMMU is in fabrication. 

In order to provide a platform for shared memory research, 
Alewife supports dynamic reconfiguration of coherence protocols 
on a block-by-block basis. The machine supports Dir n HoS^B,ACK^ 
DhnHiSNB, Dir n HiSNB, Dir n H^SuB^ Dir n H 5 SNB, Dir 5 H s SB and a 
variety of other protocols. The node diagram in Figure 1 illustrates 
a memory block with two hardware pointers and an associated 
software-extended directory structure (Dir„H2SNB)- The current 
default boot sequence configures every block of shared memory 
with a Dir„HsSNB protocol, which uses all of the available hardware 
pointers. 

In addition to the standard hardware pointers, Alewife imple- 
ments a special one-bit pointer for the node that is local to the 
directory. Several simulations show that this extra pointer im- 
proves performance by only about 2%. Its main benefit lies in 
reducing the complexity of the protocol hardware and software by 




Alewife machine 



Figure 1 : Alewife node, with a Dir n H2SrjB memory block. 

eliminating the possibility that a node will cause its local hardware 
directory to overflow. For the results presented in this paper, all 
of the protocols (except Dir n HoS^B,ACK) use me one-bit pointer in 
addition to the normal hardware pointers. 

3.2 NWO: the Alewife simulator 

NWO is a multi-purpose simulator that provides a deterministic 
debugging and test environment for the Alewife machine. The 
simulator performs a cycle-by-cycle simulation of all of the com- 
ponents in Alewife. NWO is binary compatible with Alewife's 
hardware: programs that run on the simulator will be able to run 
on the actual machine without recompilation. The CMMU pro- 
tocol state-transition tables are automatically compiled from the 
hardware specification into a simulator executable format, so that 
NWO incorporates the hardware protocol directly. NWO models 
the Alewife data paths accurately enough that it is used to drive the 
transistor-level simulations of the CMMU. Although Alewife does 
not support one-pointer protocols or protocols with more than five 
hardware pointers, NWO has been extended to support a complete 
spectrum of software-extended protocols, from Dir„HoS^B,ACK to 
Dir n H NB S- . 

There are two inaccuracies in the simulation. First, NWO does 
not model the Sparcle or FPU pipelines, even though it does model 
many of the pipelined data paths within the CMMU. Second, 
NWO models communication contention at the CMMU network 
transmit and receive queues, but does not model contention within 
the network switches. 

The initial implementation of NWO targeted SPARC and MIPS- 
based workstations; we have also developed a version of the sim- 
ulator that runs on Thinking Machines' CM-5 multiprocessors. In 
the latter implementation, each CM-5 node simulates the processor, 
memory, and network hardware of one or more Alewife nodes . The 
CM-5 port of our simulator has proved invaluable, especially for 
running simulations of 64 and 256 node Alewife systems. 



4 Software interfaces 

Two different versions of the protocol extension software have been 
written for the Alewife machine. One version incorporates a flexi- 
ble coherence interface that allows rapid protocol implementation 
in the C programming language. The other version is a set of pro- 
tocol handlers that are implemented in assembly language and are 
carefully tuned to take full advantage of the features of the Alewife 
architecture. 



4.1 Protocol software implementations 

The C version of the Alewife protocol extension software imple- 
ments the whole range of software-extended protocols within a 
flexible framework. A single set of C routines implements all of 
the protocols from Dir n H2SrjB to DirnH^gS-- Other modules 
linked into the same kernel support Dir n HoS^B,ACK^ Dir„HiSNB, 
Dir n HiS NBtLACK , and Dir n HiS NBAC K- 

The flexible interface facilitates the construction of all of these 
protocols by providing C macros for hardware directory manipu- 
lation, protocol message transmission, a free-listing memory man- 
ager, and hash table administration. The interface eliminates the 
need for the protocol designer to understand many of the details 
of the Alewife hardware implementation. For example, the proto- 
col interface sets up an environment that lets the protocol designer 
treat every protocol event as if it were generated by an asynchronous 
inter-node request. 

In addition, the framework hides other implementation details 
such as atomic protocol transitions and livelock situations. The 
framework ensures the atomicity of protocol transitions by guaran- 
teeing that asynchronous messages in a CMMU internal queue are 
processed before handling synchronous events. Livelock situations 
can occur when protocol software-extension requests occur so fre- 
quently that user code cannot make forward progress. The frame- 
work solves this problem by using a timer interrupt to implement 
a watchdog that detects possible livelock, temporarily shuts off 
asynchronous events, and allows the user code to run unmolested. 
In practice, such conditions happen only for Dir n HoS^B,ACK an d 
Dir n H\SNB,ACK> when they handle acknowledgments in software. 
The framework provides a very simple interface that allows these 
protocols to invoke the watchdog directly. 

During the course of the study reported in this paper, the flexible 
coherence interface proved itself to be an indispensable tool for 
rapidly prototyping a complete set of protocols. The framework 
is currently being used to implement some of the enhancements to 
the basic protocols that are described in Section 7. 

Unfortunately, flexibility comes at the cost of performance. All 
of the mechanisms that protect the protocol designer from the de- 
tails of the Alewife hardware implementation increase the time that 
it takes to handle protocol requests in software. An assembly- 
language version of the Alewife protocol extension handlers helps 
investigate the performance versus flexibility tradeoff by optimiz- 
ing the performance of the software. Since this approach re- 
quires a large programming effort, this version only implements 
Dir„H 5 SNB- 

The code for this optimized version is hand-tuned to keep in- 
struction counts to a minimum. To reduce memory management 
time, it uses a special free-list of extended directory structures that 



are initialized when the kernel boots the machine. The assembly- 
language version also takes advantage of a feature of Alewife's 
directory that eliminates the need for a hash table lookup. 

4.2 Comparing the implementations 

The primary difference between the performance of the two imple- 
mentations of the protocol extension software is the amount of time 
that it takes to process a protocol request. Table 1 gives the average 
number of cycles required to process Dir„HsSNB read and write 
requests for both of the implementations. These software handling 
latencies were measured by running the WORKER benchmark (de- 
scribed in the next section) on a 16 node system. The latencies are 
relatively independent of the number of nodes that read each mem- 
ory block. In most cases, the hand-tuned version of the software 
reduces the latency of protocol request handlers by about a factor 
of two. 
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Table 1 : Average software extension latencies for C and for as- 
sembly language, in execution cycles. 



These latencies may be understood better by analyzing the num- 
ber of cycles spent on each activity required to extend a protocol in 
software. Table 2 accounts for all of the cycles spent in a read and 
a write request from both versions of the protocol software. These 
counts come from cycle-by-cycle traces of read and write requests 
with eight readers and one writer per memory block. In order to 
select a representative individual from each sample, we choose a 
median request of each type (as opposed to the average, which we 
use above to summarize aggregate behavior). 

The dispatch and trap return activities are standard sequences 
of code that invoke hardware exception and interrupt handlers and 
allow them to return to user code, respectively. (The dispatch 
activity does not include the three cycles that Sparcle takes to flush 
its pipeline and to load the first trap instruction.) In the assembly- 
language version, these sequences are streamlined to invoke the 
protocol software as quickly as possible. The C implementation of 
the software requires an extra protocol-specific dispatch in order 
to set up the C environment and hide the details of the Alewife 
hardware. For the types of protocol requests that occur when 
running the WORKER benchmark, this extra overhead does not 
significantly impact performance. The extra code in the C version 
that supports the non-Alewife protocols implemented only in the 
simulator also impacted performance minimally. 

The difference between the performance of the C and assembly- 
language protocol handlers lies in the flexibility of the C interface. 
The assembly-language version avoided most of the expense of 
memory management and hash table administration by implement- 
ing a special-purpose solution to the directory structure allocation 
and lookup problem. This solution relies heavily on the format 
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Table 2: Breakdown of execution cycles measured from median-latency read and write requests. Each memory block has 8 readers and 1 
writer. N/A stands for not applicable. 



of Alewife's coherence directory and is not robust in the context 
of a system that runs a large number of different applications over 
a long period of time. However, it does place a minimum bound 
on the time required to perform these tasks. As the Alewife sys- 
tem evolves, critical pieces of the protocol extension software that 
are implemented under the flexible interface will be hand-tuned to 
realize the best of both worlds. 



5 Worker sets and performance 

A worker set is defined to be the set of nodes that simultaneously 
access a unit of data. The software-extension approach is predicated 
on the observation that, for a large class of applications, most 
worker sets are relatively small. Small worker sets are handled in 
hardware by a limited directory structure. Memory blocks with 
large worker sets must be handled in software, at the expense of 
longer memory access latency and processor cycles that are spent 
on protocol handlers rather than on user code. 

This section uses a synthetic benchmark to investigate the rela- 
tionship between an application's worker sets and the performance 
of software-extended coherence protocols. The benchmark, called 
WORKER, uses a data structure that creates memory blocks with 
an exact worker set size. WORKER consists of an initialization 
phase that builds the worker set data structure and a number of 
iterations that perform repeated memory accesses to the structure. 

The nodes begin each iteration by reading the appropriate slots in 
the worker set structure. After the reads, they execute a barrier and 
then perform the writes to the structure. Finally, the nodes execute 
a barrier and continue with the next iteration. Every read request 
causes a cache miss and every write request causes a directory 
protocol to send exactly one invalidation message to each reader. 
This completely deterministic memory access pattern provides a 
controlled experiment for comparing the performance of different 
protocols. 

In order to analyze the relationship between worker set sizes and 
the performance of software-extended shared memory, we perform 
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Figure 2: Protocol performance and worker set size. 

simulations of WORKER running on a range of protocols. The 
simulations are restricted to a relatively small system because the 
benchmark is both regular and completely distributed, so the results 
would not be qualitatively different for a larger number of nodes. 

Figure 2 presents the results of a series of 16 node simulations. 
The horizontal axis gives the size of the worker sets generated by the 
benchmark. The vertical axis measures the ratio of the run-time of 
each protocol and the run-time of a full-map protocol (Dir n HfjBS-) 
running the same benchmark configuration. 

The solid curves in Figure 2 indicate the performance of some 
of the protocols that are implemented in the actual Alewife ma- 
chine. As expected, the more hardware pointers, the better the 
performance of the software-extended system. The performance 
of Dir„HsSNB is particularly easy to interpret: its performance is 



exactly the same as the full-map protocol up to a worker set size 
of 4, because the worker sets fit entirely within the hardware direc- 
tory. For small worker set sizes, software is never invoked. The 
performance of Dir„HsSNB drops as the worker set size grows, due 
to the expense of handling memory requests in software. 

At the other end of the performance scale, the Dir n HoS^B,ACK 
protocol performs significantly worse than the other protocols, for 
all worker set sizes. Since WORKER is a shared memory stress 
test and exaggerates the differences between the protocols, Figure 2 
shows the worst possible performance of the software-only direc- 
tory. The measurements in the next section, which experiment with 
more realistic applications, yield a more optimistic outlook for the 
zero and one-pointer protocols. 

The dashed curves correspond to one-pointer protocols that run 
only in the simulation environment. These three protocols dif- 
fer only in the way that they handle acknowledgment messages 
(see Section 2.4). For all non-trivial worker set sizes, the proto- 
col that traps on every acknowledgment message {Dir n H\S^B,ACK) 
performs significantly worse than the protocols that can count ac- 
knowledgments in hardware. Dir„H\SNB, which never traps on 
acknowledgment messages, has very similar performance to the 
Dir„H2SNB protocol, except when running with size 1 worker sets. 
Since this version of Dir n H\S^B requires the same amount of di- 
rectory storage as Dir„H2SNB, the similarity in performance is not 
surprising. 

Of the three different one-pointer protocols, the protocol that 
traps only on the last acknowledgment message in a sequence 
(Dir n H\S^B,LACK) makes the most cost-efficient use of the hard- 
ware pointers. This efficiency comes at a slight performance cost. 
For the WORKER benchmark, this protocol performs between 0% 
and 50% worse than Dir n H\S^B. When the worker set size is 
4 nodes, Dir n H[S^B,LACK actually performs slightly better than 
Dir„HiSNB- This anomaly is due to a memory-usage optimization 
that attempts to reduce the size of the software-extended directory 
when handling small worker sets. The optimization, implemented 
in the Dir n HiS NB ,LACK, Dir n HiS NB ,ACK and Dir n H S NBrA CK pro- 
tocols, improves the run-time performance of all three protocols 
for worker set sizes of 4 or less. 



6 Application case studies 

This section presents more practical case-studies of several pro- 
grams and investigates how the performance of applications de- 
pends on memory access patterns, the coherence protocol, and 
other machine parameters. 

The names and characteristics of the applications we analyze 
are given by Table 3. They are written in C, Mul-T[18] (a parallel 
dialect of LISP), and Semi-C[15] (a language akin to C with sup- 
port for fine-grain parallelism). Each application (except MP3D) is 
studied with a problem size that realizes more than 50% processor 
utilization on a simulated 64 node machine with a full-map direc- 
tory. All performance results use protocols implemented with the 
flexible coherence interface described in Section 4. 

Figure 4 presents the basic performance data for the applications 
running on 64 nodes. The horizontal axis shows the number of di- 
rectory pointers implemented in hardware, thereby measuring the 
cost of the system. The vertical axis shows the speedup of the mul- 



Name 


Language 


Size 


Sequential 


TSP 


Mul-T 


10 city tour 


1.1 sec 


AQ 


Semi-C 


see text 


0.9 sec 


SMGRID 


Mul-T 


129 x 129 


3.0 sec 


EVOLVE 


Mul-T 


12 dimensions 


1.3 sec 


MP3D 


C 


10,000 particles 


0.6 sec 


WATER 


C 


64 molecules 


2.6 sec 



Table 3: Characteristics of applications. Sequential time assumes 
a clock speed of 33MHz. 
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Figure 3: TSP: detailed 64 node performance analysis. 

tiprocessor execution over a sequential run without multiprocessor 
overhead. The software-only directory is always on the left and the 
full-map directory on the right. All of the figures in this section 
show Dir n HiS^B,ACK performance for the one-pointer protocol. 

The most important observation is that the performance of 
Dir„HsSNB is always between 71% and 100% of the performance 
of Dir n HfjBS-- Thus, the data in Figure 4 provides strong evi- 
dence that the software extension approach is a viable alternative 
for implementing a shared memory system. The rest of this section 
seeks to provide a more detailed understanding of the performance 
of software-extended systems. 

Traveling Salesman Problem TSP solves the traveling sales- 
man problem using a branch-and-bound graph search. The applica- 
tion is written in Mul-T and uses the future construct to specify 
parallelism. In order to ensure that the amount of work performed 
by the application is deterministic, we seed the best path value 
with the optimal path. Given the characteristics of the application's 
memory access pattern, one would expect TSP to perform well with 
a software-extended protocol: the application has very few large 
worker sets. In fact, most - but not all - of the worker sets are small 
sets of nodes that concurrently access partial tours. 

Figure 3 presents detailed performance data for TSP running 
on a 64 node machine. Contrary to our expectations, TSP suffers 
severe performance degradation when running with the software- 
extended protocols. The gray bars in the figure show that the 
five-pointer protocol performs more that 3 times worse than the full- 
map protocol. This performance decrease is due to instruction/data 
thrashing in Alewife's combined, direct-map caches: When we 
profiled the address reference pattern of the application, we found 
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Figure 4: Application speedups over sequential, running on 64 nodes. Horizontal axis shows number of hardware pointers. Vertical axis 
shows speedup over sequential execution. 



that two memory blocks that were shared by every node in the 
system were constantly replaced in the cache by commonly run 
instructions. 

In order to confirm this observation, we invoked a simulator op- 
tion that allows one-cycle access to every instruction without using 
the cache. This option, called perfect if etch, eliminates the effects 
of instructions on the memory system. The hashed bars in Figure 3 
confirm that instruction/data thrashing was a serious problem in the 
initial runs. Absent the effects of instructions, all of the protocols 
except the software-only directory realize performance equivalent 
(within experimental error) to a full-map protocol. 

While perfect instruction fetching is not possible in real systems, 
there are various methods for relieving instruction/data thrashing 
by increasing the associativity of the cache system. Alewife's 
approach to the problem is to implement a version of victim 
caching[16], which uses the transaction store[20] to provide a small 
number of buffers for storing blocks that are evicted from the cache. 
The black bars in Figures 4(a) and 3 show the performance for TSP 
on a system with victim caching enabled. The few extra buffers 
improve the performance of the full-map protocol by 16%, and 
allow all of the protocols with hardware pointers to perform about 
as well as full-map. For this reason, the studies of all of the other 
applications in this section enable victim-caching by default. 

It is interesting to note that Dir n HoS^B,ACK with victim caching 
achieves almost 70% of the performance ofDir n H^BS- ■ This low- 
cost alternative seems viable for applications with limited amounts 
of sharing. 



Thus far, we have compared the performance of the protocols 
under an environment where the full-map protocol achieves close 
to maximum speedup. On an application that requires only 1 
second to run, the system with victim caching achieves a speedup 
of about 55 for the 5 pointer protocol. In order to investigate the 
effects of running an application with suboptimal speedups, we 
ran the same problem size on a 256 node machine with victim 
caching enabled. Figure 5 shows the results, which indicate a 
speedup of 142 for full-map and 1 34 for five-pointers. We consider 
these speedups remarkable for this problem size and note that the 
software-extended system performs only 6% worse than full-map in 
this configuration. The difference in performance is due primarily 
to the increased contribution of the transient effects over distributing 
data to 256 nodes at the beginning of the run. 

Adaptive Quadrature AQ performs numerical integration of 
bivariate functions using adaptive quadrature. The core of the 
algorithm is a function that integrates the range under a curve by 
recursively calling itself to integrate sub-ranges of that range. The 
function used for this study is x 4 j/ 4 , which is integrated over the 
square ((0, 0), (2, 2)) with an error tolerance of 0.005. 

Since all of the communication in the application is producer- 
consumer, we expect this application to perform equally well for all 
protocols that implement at least one directory pointer in hardware. 
Figure 4(b) confirms this expectation by showing the performance 
of the application running on 64 nodes. Again, Dir n HoS^B,ACK 
performs respectably due to the favorable memory access patterns 
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Figure 5: TSP running on 256 nodes, 
in the application. 

Static Multigrid SMGRID uses the multigrid method to solve 
elliptical partial differential equations [13]. The algorithm consists 
of performing a series of Jacobi-style iterations on multiple grids 
of varying granularities. The speedup over sequential is limited 
by the fact that only a subset of nodes work during the relaxation 
on the upper levels of the pyramid of grids. Furthermore, data is 
more widely shared in this application than in either TSP or AQ. 
The consequences of these two factors appear in Figure 4(c): the 
absolute speedups are lowerthan either of the previous applications, 
even though the sequential time is three times longer. 

The larger worker set sizes of multigrid cause the performance 
of the different protocols to separate. Dir n HoS^B,ACK performs 
more than three times worse than the full-map protocol. The others 
range from 25% worse in the case of Dir n H\S^B,ACK t0 6% worse 
in the case of Dir n H 5 SNB- 

Genome Evolution EVOLVE is a graph traversal algorithm for 
simulating the evolution of genomes, which is reduced to the prob- 
lem of traversing a hypercube and finding local and global maxima. 
The application searches for a path from the initial conditions to a 
local fitness maximum. 

Of all of the applications in Figure 4, EVOLVE causes 
Dir„HsSNB to exhibit the worst performance degradation compared 
to Dir n HfjBS- '■ the worker sets of EVOLVE seriously challenge a 
software-extended system. Figure 6 shows the number of worker 
sets of each size at the end of a 64 node run. Note that the vertical 
axis is logarithmically scaled: there are almost 10,000 one-node 
worker sets, while there are 25 worker sets of size 64. The sig- 
nificant number of nontrivial worker sets implies that there should 
be a sharp difference between protocols with different numbers of 
pointers. The large worker sets sizes impact the and 1 pointer 
protocols most severely. Thus, EVOLVE provides a good example 
of a program that can benefit from a system's hardware directory 
pointers. 

MP3D The MP3D application is part of the SPLASH parallel 
benchmark suite [31]. For our simulations, we use a problem size of 
10,000 particles, turn the locking option off, and augment the stan- 
dard p4 macros with Alewife's parallel C library[25]. Since this 
application is notorious for exhibiting low speedups [22], the results 
in Figure 4(e) are encouraging: Dir n HfjBS- achieves a speedup of 



Figure 6: Histogram of worker set sizes for EVOLVE, running on 
64 nodes. 

24 and Dir„HsSNB realizes a speedup of 20. These speedups are for 
a relatively small problem size, and we expect absolute speedups 
to increase with problem size. 

The software-only directory exhibits the worst performance 
(only 11% of the speedup of full-map) on MP3D. Thus, MP3D 
provides another example of an application that can benefit from at 
least a small number of hardware directory pointers. 

Water The Water application, also from the SPLASH applica- 
tion suite, is run with 64 molecules. In addition to the p4 macros, 
this version of Water uses Alewife's parallel C library for barriers 
and reductions. Figure 4(f) shows that all of the software-extended 
protocols provide good speedups for this tiny problem size. Once 
again, the software-only directory offers almost 70% of the perfor- 
mance of the full-map directory. 



7 Enhancement opportunities 

This paper uses a basic definition of software-extended coherent 
shared memory in order to analyze the viability of the approach. 
The Dir,HxSy,A notation itself implies a straightforward software 
directory extension. While this definition allows for a relatively 
simple analysis technique, the true power of the software-extension 
approach lies in deviating from the basic implementation. The 
generality of the flexible coherence interface described in Section 4 
provides a platform for experimenting with schemes that enhance 
the performance and the functionality of the base protocols. 

[9] suggests several extensions to the basic software such as a 
FIFO lock data type. To date, the protocol extension software has 
been used to implement a FIFO lock data type, stack overflow ex- 
ceptions, and a fast barrier implementation. These enhancements 
are aimed at providing efficient functions that improve the pro- 
grammability of the machine. [9] also indicates that LimifLESS 
software could be enhanced to improve the performance of nor- 
mal shared memory variables, such as variables with large worker 
sets. The following types of extensions give examples of current 
research: 

Program and compiler annotations Program annotations 
allow a programmer to give the system information about the way 
that an application interacts with shared memory. [14] and [34] 
propose and evaluate this method for improving the performance 



of software-extended shared memory. The studies show that given 
appropriate annotations, a large class of applications can perform 
well on Dir\H\SB,LACK- [24] demonstrates a compiler annotation 
scheme for optimizing the performance of protocols that dynami- 
cally allocate directory pointers. 

Dynamic detection [12] and [27] propose a hardware mecha- 
nism that dynamically adapts to migratory data. Protocol extension 
software could perform similar optimizations. In addition, there 
are some classes of data that create severe performance bottlenecks. 
These classes tend to be the result of a simplistic programming style 
or a performance bug. Examples of these widely-shared data struc- 
tures include synchronization objects, work queues, and frequently- 
written global objects. Preliminary results from our experiments 
show that protocol extension software may improve performance 
for this type of data by dynamically selecting sequential or parallel 
invalidation procedures. 

Profile, detect, and optimize Some types of data do not create 
serious performance bottlenecks, but can benefit from optimization. 
An example of this class of data is widely-shared, read-only data. 
During the development phase of an application, enhanced protocol 
software could be used in a profiling mode to detect the existence of 
read-only data. The system could use the information to optimize 
the production version of the application. 
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Data specific Some types of data might be hard to optimize 
automatically, either dynamically or statically. In this case, a user 
could select special coherence types from a library, or even write 
an application-specific protocol under the flexible coherence inter- 
face. 



8 Conclusions 

The software extension approach offers a cost-efficient method 
for implementing scalable, coherent, high-performance shared- 
memory. Experience with the design of such a system shows 
that a minimum of one directory pointer and an acknowledgment 
counter should be implemented in hardware. Since all of the proto- 
cols that implement small numbers of hardware directory pointers 
have similar performance, factors such as the cost and mapping 
of each node's DRAM will dominate performance considerations 
when building a software-extended system. 

The hardware components of a software-extended system must 
be tuned carefully to achieve high performance. Since the software- 
extended approach increases the penalty of cache misses, thrashing 
situations cause particular concern. Adding extra associativity to 
the processor side of the memory system, by implementing vic- 
tim caches or by building set-associative caches, can dramatically 
decrease the effects of thrashing on the system as a whole. 

Experiments with the implementation of protocol software indi- 
cate that such systems should include a flexible coherence interface 
that facilitates the implementation of specialized protocols. Such 
protocols could enhance the basic protocol software to improve both 
the programmability of machines and the performance of shared 
memory. 
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